---
subtitle: Step-by-step guide on how to evaluate and optimize AI agents throughout their lifecycle
---

Building AI agents isn’t just about making them work: it’s about making them reliable, intelligent, and scalable.

As agents reason, act, and interact with real users, treating them like black boxes isn’t enough.

To ship production-grade agents, teams need a clear path from development to deployment, grounded in **observability, testing, and optimization**.

This guide walks you through the agent lifecycle and shows how Opik helps at every stage.

<Tip>
  In this guide, we will focus on evaluating complex agentic applications. If you are looking at evaluating single
  prompts you can refer to the [Evaluate A Prompt](/evaluation/evaluate_prompt) guide.
</Tip>

### 1. Start with Observability

The first step in agent development is making its behavior transparent. From day one, you should instrument your agent with trace logging — capturing inputs, intermediate steps, tool calls, outputs, and errors.

With **just two lines of code**, you unlock full visibility into how your agent thinks and acts. Using Opik, you can inspect every step, understand what happened, and quickly debug issues.

<Frame caption="Adding tracing to an agent">
  <img src="/img/evaluation/evaluation_agents_tracing.png" alt="Adding tracing capabilities to an AI agent" />
</Frame>

<Tip>
  This guide uses a Python agent built with LangGraph to illustrate tracing and evaluation. If you're using other
  frameworks like OpenAI Agents, CrewAI, Haystack, or LlamaIndex, you can check out our [Integrations
  Overview](/integrations/overview) to get started with tracing in your setup.
</Tip>

Once you’ve logged your first traces, Opik gives you immediate access to valuable insights, not just about what your agent did, but how it performed. You can explore detailed trace data, see how many traces and spans your agent is generating, track token usage, and monitor response latency across runs.

<Frame caption="Opik dashboard showing cost, latency, and token usage">
  <img
    src="/img/evaluation/evaluation_agents_dashboard.png"
    alt="Dashboard displaying trace data, spans, and token usage for AI agents"
  />
</Frame>

For each interaction with the end user, you can also know how the agent planned, chose tools, or crafted an answer based on the user input, the agent graph and much more.

<Frame caption="Detailed view of a single agent trace">
  <img
    src="/img/evaluation/evaluation_agents_trace.png"
    alt="Detailed visualization of a single agent trace with steps and tool interactions"
  />
</Frame>

During development phase, having access to all this information is fundamental for debugging and understanding what is working as expected and what’s not.

**Error detection**

Having immediate access to all traces that returned an error can also be life-saving, and Opik makes it extremely easy to achieve:

<Frame caption="List of traces highlighting agent errors">
  <img
    src="/img/evaluation/evaluation_agents_traces.png"
    alt="List view of multiple agent traces with error detection highlights"
  />
</Frame>

For each of the errors and exceptions captured, you have access to all the details you need to fix the issue:

<Frame caption="Error details view for a failed agent trace">
  <img
    src="/img/evaluation/evaluation_agents_errors.png"
    alt="Expanded error report showing causes and contexts of agent failures"
  />
</Frame>

### 2. Evaluate Agent's End-to-end Behavior

Once you have full visibility on the agent interactions, memory and tool usage, and you made sure everything is working at the technical level, the next logical step is to start checking the quality of the responses and the actions your agent takes.

**Human Feedback**

The fastest and easiest way to do it is providing manual human feedback. Each trace and each span can be rated “Correct” or “Incorrect” by a person (most probably you!) and that will give a baseline to understand the quality of the responses.

You can provide human feedback and a comment for each trace’s score in Opik and when you’re done you can store all results in a dataset that you will be using in next iterations of agent optimization.

<Frame caption="Human feedback for agent traces">
  <img
    src="/img/evaluation/evaluation_agents_scores.png"
    alt="Human feedback interface for agent traces with scores and comments"
  />
</Frame>

**Online evaluation**

Marking an answer as simply “correct” or “incorrect” is a useful first step, but it’s rarely enough. As your agent grows more complex, you’ll want to measure how well it performs across more nuanced dimensions.

That’s where online evaluation becomes essential.

With Opik, you can automatically score traces using a wide range of metrics, such as answer relevance, hallucination detection, agent moderation, user moderation, or even custom criteria tailored to your specific use case. These evaluations run continuously, giving you structured feedback on your agent’s quality without requiring manual review.

<Tip>
  Want to dive deeper? Check out the [Metrics Documentation](/evaluation/metrics/overview) to explore all the heuristic
  metrics and LLM-as-a-judge evaluations that Opik offers out of the box.
</Tip>

### 3. Evaluate Agent’s Steps

When building complex agents, evaluating only the final output isn't enough. Agents reason through **sequences of actions**—choosing tools, calling functions, retrieving memories, and generating intermediate messages.

Each of these **steps** can introduce errors long before they show up in the final answer.

That’s why **evaluating agent steps independently** is a core best practice.

Without step-level evaluation, you might only notice failures after they impact the final user response, without knowing where things went wrong.
With step evaluation, you can catch issues as they occur and identify exactly which part of your agent’s reasoning or architecture needs fixing.

#### **What Steps Should You Evaluate?**

Depending on your agent architecture, you might want to score:

| Step Type                 | Example Evaluation Questions                                                      |
| ------------------------- | --------------------------------------------------------------------------------- |
| **Tool Calls**            | Did the agent pick the right tool for the job? Did it provide correct parameters? |
| **Memory Retrievals**     | Was the retrieved memory relevant to the query?                                   |
| **Plans**                 | Did the agent generate a coherent, executable plan?                               |
| **Intermediate Messages** | Was the internal reasoning logical and consistent?                                |

For each of those steps you can use one of Opik’s predefined metrics or create your own custom metric that adapts to your needs.

### 4. Example: Evaluating Tool Selection Quality with Opik

When building agents that use tools (like web search, calculators, APIs…), it’s critical to know **how well your agent is choosing and using those tools**.

Are they picking the right tool? Are they using it correctly? Are they wasting time or making mistakes?

The easiest way to measure this in Opik is by running a **custom evaluation experiment**.

#### **What We'll Do**

In this example, we'll use Opik's SDK to create a **script that will run an experiment** to **measure how well an agent selects tools**.

When you run the experiment, Opik will:

- Execute the agent against every item in a dataset of examples.
- Evaluate each agent interaction using a custom metric.
- Log results (scores and reasoning) into a dashboard you can explore.

This will give you a **clear, data-driven view** of how good (or bad!) your agent’s tool selection behavior really is.

#### **What We Need**

For every Experiment we want to run, the most important elements we need to create are the following:

<Steps>
  <Step title="A Dataset">
    A set of example user queries and expected correct tool usage.
  </Step>

  <Step title="A Metric">
    A way to automatically decide if the agent’s behavior was correct or not (we’ll create a custom one).
  </Step>
  
  <Step title="An Evaluation Task">
    A function that tells Opik how to run your agent on each dataset item.
  </Step>
</Steps>

#### Full Example: Tool Selection Evaluation Script

Here’s the full example:

```python

import os
from opik import Opik
from opik.evaluation import evaluate
from agent import agent_executor
from langchain_core.messages import HumanMessage
from experiments.tool_selection_metric import ToolSelectionQuality

os.environ["OPIK_API_KEY"] = "YOUR_API_KEY"
os.environ["OPIK_WORKSPACE"] = "YOUR_WORKSPACE"

client = Opik()

# This is the dataset with the examples of good tool selection
dataset = client.get_dataset(name="Your_Dataset")

"""
Note: if you don't have a dataset yet, you can easily create it this way:
dataset = client.get_or_create_dataset(name="My_Dataset")

# Define the items
items = [
    {
        "input": "Find information about adding numbers.",
        "expected_output": "tavily_search_results_json"
    },
    {
        "input": "Multiply 7×6",
        "expected_output": "simple_math_tool"
    }
    [...]
]

# Insert the dataset items
dataset.insert(items)
"""

# This function defines how each item in the dataset will be evaluated.
# For each dataset item:
# - It sends the `input` as a message to the agent (`agent_executor`).
# - It captures the agent's actual tool calls from its outputs.
# - It packages the original input, the agent's outputs, the detected tool calls, and the expected tool calls.
# This structured output is what the evaluation platform will use to compare expected vs actual behavior using the custom metric(s) you define.

def evaluation_task(dataset_item):
    try:
        user_message_content = dataset_item["input"]
        expected_tool = dataset_item["expected_output"]

        # This is where you call your agent with the input message and get the real execution results.
        result = agent_executor.invoke({"messages": [HumanMessage(content=user_message_content)]})

        tool_calls = []

        # Here we extract the tool calls the agent actually made.
        # We loop through the agent's messages, check tool calls,
        # and for each tool call, we capture its metadata.
        for msg in result.get("messages", []):
            if hasattr(msg, "tool_calls") and msg.tool_calls:
                for tool_call in msg.tool_calls:
                    tool_calls.append({
                        "function_name": tool_call.get("name"),
                        "function_parameters": tool_call.get("args", {})
                    })

        return {
            "input": user_message_content,
            "output": result,
            "tool_calls": tool_calls,
            "expected_tool_calls": [{"function_name": expected_tool, "function_parameters": {}}]
        }

    except Exception as e:
        return {
            "input": dataset_item.get("input", {}),
            "output": "Error processing input.",
            "tool_calls": [],
            "expected_tool_calls": [{"function_name": "unknown", "function_parameters": {}}],
            "error": str(e)
        }

# This is the custom metric we have defined
metrics = [ToolSelectionQuality()]

# This function runs the full evaluation process.
# It loops over each dataset item and applies the `evaluation_task` function to generate outputs.
# It then applies the custom `ToolSelectionQuality` metric (or any provided metrics) to score each result.
# It logs the evaluation results to Opik under the specified experiment name ("AgentToolSelectionExperiment").
# This allows tracking, comparing, and analyzing your agent's tool selection quality over time in Opik.

eval_results = evaluate(
  experiment_name="AgentToolSelectionExperiment",
  dataset=dataset,
  task=evaluation_task,
  scoring_metrics=metrics
)

```

The Custom Tool Selection metric looks like this:

```python
from opik.evaluation.metrics import base_metric, score_result

class ToolSelectionQuality(base_metric.BaseMetric):
    def __init__(self, name: str = "tool_selection_quality"):
        self.name = name

    def score(self, tool_calls, expected_tool_calls, **kwargs):
        try:
            actual_tool = tool_calls[0]["function_name"]
            expected_tool = expected_tool_calls[0]["function_name"]

            if actual_tool == expected_tool:
                return score_result.ScoreResult(
                    name=self.name,
                    value=1,
                    reason=f"Correct tool selected: {actual_tool}"
                )
            else:
                return score_result.ScoreResult(
                    name=self.name,
                    value=0,
                    reason=f"Wrong tool. Expected {expected_tool}, got {actual_tool}"
                )
        except Exception as e:
            return score_result.ScoreResult(
                name=self.name,
                value=0,
                reason=f"Scoring error: {e}"
            )
```

After running this script:

- You will see a **new experiment in Opik**.
- Each item will have a **tool selection score** and a **reason** explaining why it was correct or incorrect.
- You can then **analyze results**, **filter mistakes**, and **build better training data** for your agent.

This method is a scalable way to **move from gut feelings to hard evidence** when improving your agent's behavior.

<Frame caption="Experiments Dashboard in Opik">
  <img src="/img/evaluation/evaluation_agents_experiments.png" alt="Experiments dashboard in Opik" />
</Frame>

#### What Happens Next? Iterate, Improve, and Compare

Running the experiment once gives you a **baseline**: a first measurement of how good (or bad) your agent's tool selection behavior is.

But the real power comes from **using these results to improve your agent** — and then **re-running the experiment** to measure progress.

Here’s how you can use this workflow:

<Steps>
  <Step title="Run the initial evaluation experiment">
    See where your agent is making tool selection mistakes.
  </Step>

{" "}

<Step title="Analyze the results in Opik">
  Look at the most common errors and read the reasoning behind low scores.
</Step>

{" "}

<Step title="Make improvements to your agent">
  Update the <strong>system prompt</strong> to improve instructions, refine <strong>tool descriptions</strong>, and
  adjust <strong>tool names or input formats</strong> to be more intuitive.
</Step>

{" "}

<Step title="Re-run the evaluation experiment">
  Use the same dataset to measure how your changes affected tool selection quality.
</Step>

{" "}

<Step title="Compare the results">
  Review improvements in score, spot reductions in errors, and identify new patterns or regressions.
</Step>

{" "}

<Frame caption="Comparing results between experiments">
  <img
    src="/img/evaluation/evaluation_agents_compare_experiments.png"
    alt="Comparing results between experiments in Opik"
  />
</Frame>

  <Step title="Repeat the cycle until quality is met">
    Iterate as many times as needed to reach the level of performance you want from your agent.
  </Step>
</Steps>

And this is just for one module! You can next move to the next component of your agent

**You can evaluate modules with metrics like the following:**

- **Router**: tool selection and parameter extraction
- **Tools**: Output accuracy, hallucinations
- **Planner**: Plan length, validity, sufficiency
- **Paths**: Looping, redundant steps
- **Reflection**: Output quality, retry logic

### 5. Wrapping Up: Where to Go From Here

Building great agents is a journey that doesn’t stop at getting them to “work.”
It’s about creating agents you can trust, understand, and continuously improve.

In this guide, you’ve learned how to make agent behavior observable, how to evaluate outputs and reasoning steps, and how to design experiments that drive real, measurable improvements.

But this is just the beginning.

From here, you might want to:

- Optimize your prompts to drive better agent behavior with **[Prompt Optimization](/agent_optimization/overview)**.
- Monitor agents in production to catch regressions, errors, and drift in real-time with **[Production Monitoring](/production/production_monitoring)**.
- Add **[Guardrails](/production/guardrails)** for security, content safety, and sensitive data leakage prevention, ensuring your agents behave responsibly even in dynamic environments.

Each of these steps builds on the foundation you’ve set: observability, evaluation, and continuous iteration.

By combining them, you’ll be ready to take your agents from early prototypes to production-grade systems that are powerful, safe, and scalable.
